refactor: rewrite parser core as explicit Block tree#285
refactor: rewrite parser core as explicit Block tree#285Hocnonsense wants to merge 55 commits intomasterfrom
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #285 +/- ##
===========================================
- Coverage 96.52% 83.13% -13.39%
===========================================
Files 12 13 +1
Lines 1497 2485 +988
Branches 309 555 +246
===========================================
+ Hits 1445 2066 +621
- Misses 24 295 +271
- Partials 28 124 +96
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (10)
snakefmt/blocken.py (7)
1859-1866: Parameterinputshadows Python builtin.Consider renaming to
sourceorcontentto avoid shadowing the builtin.📝 Suggested fix
-def parse(input: str | Callable[[], str], name: str = "<string>"): - if isinstance(input, str): +def parse(source: str | Callable[[], str], name: str = "<string>"): + if isinstance(source, str): tokens = tokenize.generate_tokens( - iter(input.splitlines(keepends=True)).__next__ + iter(source.splitlines(keepends=True)).__next__ ) else: - tokens = tokenize.generate_tokens(input) + tokens = tokenize.generate_tokens(source) return GlobalBlock(0, TokenIterator(name, tokens), [])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakefmt/blocken.py` around lines 1859 - 1866, The parse function currently names its parameter input which shadows the Python builtin; rename the parameter to a non-shadowing name like source (or content) in the parse definition and update all references inside the function (the isinstance check, the tokenize.generate_tokens call and the TokenIterator invocation) to use the new name, and then update any callers of parse to pass the renamed parameter positionally or by the new name if keyword-used.
1086-1086: Typo in comment."olgorithm" should be "algorithm".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakefmt/blocken.py` at line 1086, Fix the typo in the inline comment that reads "please do not blame if the olgorithm is slow :)" by changing "olgorithm" to "algorithm" so the comment becomes "please do not blame if the algorithm is slow :)"; locate the comment text in snakefmt/blocken.py and update it accordingly.
57-65: Redundant assignment on line 61.
self.tokens = tokensappears redundant sinceself._live_tokens = tokensis already assigned on line 59. Ifself.tokensis intended for external access, consider documenting its purpose or removing if unused.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakefmt/blocken.py` around lines 57 - 65, The assignment self.tokens = tokens in __init__ is redundant with self._live_tokens; remove the redundant self.tokens assignment or, if external access is required, replace it with a documented property that returns self._live_tokens (e.g., implement a tokens `@property`) so all uses reference a single source of truth; update any external usages of the attribute to use the chosen API (either the property name tokens or the internal _live_tokens) and add a short docstring to clarify intended external access.
1800-1802: Typo: "sort_direcives" should be "sort_directives".The attribute name has a consistent typo throughout the codebase. While it works, fixing it would improve code readability.
The typo also appears at:
- Line 329 in
FormatState- Line 1821 in
get_formatted- Line 1881 in
setup_formatter🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakefmt/blocken.py` around lines 1800 - 1802, Rename the misspelled attribute sort_direcives to sort_directives everywhere to improve readability: update the __slots__ tuple ("mode", "sort_direcives") to ("mode", "sort_directives"), rename the attribute annotations (mode: Mode; sort_direcives: bool -> sort_directives: bool) and change all references/assignments/usages in FormatState, get_formatted, and setup_formatter to use sort_directives; ensure any constructors, attribute accesses, and tests are updated to the new name and run linters/tests to catch any remaining references.
1833-1842: Typo: "place_hode" should be "placeholder".The variable names
place_hode_strandplace_hodeappear to be typos for "placeholder".📝 Suggested fix
- place_hode_str = "o" * 50 + placeholder_str = "o" * 50 raw_str = "".join(python_codes) - while place_hode_str in raw_str: - place_hode_str *= 2 + while placeholder_str in raw_str: + placeholder_str *= 2 raw_str = "#\n" for python_code, (snakemake_code, indent) in zip(python_codes, snakemake_codes): if snakemake_code.count("\n") == 1: # must at the end of line - place_hode = f"{indent}def l{place_hode_str}1ng(): ...\n" + placeholder = f"{indent}def l{placeholder_str}1ng(): ...\n" else: - place_hode = f"{indent}def l{place_hode_str}ng():\n{indent} return\n" - raw_str += python_code + place_hode + placeholder = f"{indent}def l{placeholder_str}ng():\n{indent} return\n" + raw_str += python_code + placeholder🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakefmt/blocken.py` around lines 1833 - 1842, The variables place_hode_str and place_hode in the block inside the function that builds raw_str (the loop using python_codes and snakemake_codes) are typos and should be renamed to placeholder_str and placeholder respectively; update all usages including the initial assignment, the while loop that multiplies the string, and the f-strings that reference them (e.g., f"{indent}def l{place_hode_str}1ng(): ..." and f"{indent}def l{place_hode_str}ng():") so they become f"{indent}def l{placeholder_str}1ng(): ..." and f"{indent}def l{placeholder_str}ng():", and ensure you rename the temporary variable place_hode to placeholder where it’s assigned and used.
96-99: Minor typos in docstring."Donot" should be "Do not" and "INDEDT" should be "INDENT".
📝 Suggested fix
def next_block(self): """Returns a entire block, just consume until the end of the block. - Donot care if there are nested blocks inside or snakemake keywords inside. + Do not care if there are nested blocks inside or snakemake keywords inside. - it could be INDEDT -> [any content] -> DEDENT, or [any content] -> DEDENT + it could be INDENT -> [any content] -> DEDENT, or [any content] -> DEDENT """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakefmt/blocken.py` around lines 96 - 99, Fix the typos in the docstring in snakefmt/blocken.py that documents the routine which "Returns a entire block" (the docstring describing consuming until the end of a block); change "Donot" to "Do not" and "INDEDT" to "INDENT", and while here correct "a entire block" to "an entire block" for grammar consistency so the docstring reads clearly.
443-448: Error message note may be misleading when line number is available.The "line number may be incorrect" note is always appended, even when
start_tokenis provided and the line number is adjusted. Consider only adding this note when the line number couldn't be determined.📝 Suggested fix
else: err_msg = str(e) - err_msg += ( - "\n\n(Note reported line number may be incorrect, as" - " snakefmt could not determine the true line number)" - ) + if start_token is None or not match: + err_msg += ( + "\n\n(Note reported line number may be incorrect, as" + " snakefmt could not determine the true line number)" + ) err_msg = f"Black error:\n```\n{err_msg}\n```\n"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakefmt/blocken.py` around lines 443 - 448, The error text always appends the ambiguous line-number note before wrapping and raising InvalidPython; change the logic in the block that builds err_msg (the code that presently appends "\n\n(Note reported line number may be incorrect...)" and then wraps err_msg and raises InvalidPython) to only add that ambiguous-line-number note when the start_token (or whatever marker is used to compute an adjusted line number) is not available or the line could not be determined; keep the existing formatting that wraps err_msg in "Black error:\n```\n...\n```\n" and then raise InvalidPython(err_msg) from None but gate the extra note on start_token being falsy/None (or on the failure case that determines line-number uncertainty).tests/test_blocken.py (1)
29-32: Consider renaming parameter to avoid shadowing builtin.The parameter
inputshadows the Python builtin. While harmless in this limited scope, usingsourceorcodewould be cleaner.📝 Suggested fix
-def generate_tokens(input: str): +def generate_tokens(source: str): return list( - tokenize.generate_tokens(iter(input.splitlines(keepends=True)).__next__) + tokenize.generate_tokens(iter(source.splitlines(keepends=True)).__next__) )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_blocken.py` around lines 29 - 32, The parameter name `input` in generate_tokens shadows the Python builtin; rename it (e.g., to `source` or `code`) in the function signature and update all references inside the function (the call to input.splitlines and any callers in tests) to use the new name so the tokenizer still receives the same string argument.tests/test_formatter.py (1)
2444-2444: Unused variableformatted0.The unpacked variable
formatted0is never used in this test. Use underscore prefix to indicate it's intentionally ignored.📝 Suggested fix
- code1, formatted0 = TestSortFormatting.sorting_comprehensive + code1, _formatted0 = TestSortFormatting.sorting_comprehensive🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_formatter.py` at line 2444, The test unpacks TestSortFormatting.sorting_comprehensive into code1, formatted0 but never uses formatted0; change the unused variable to a deliberately ignored name (e.g., _formatted0 or _) in the assignment where TestSortFormatting.sorting_comprehensive is unpacked so the intent is clear and linters won't flag formatted0 as unused..flake8 (1)
3-4: Comment is now outdated.The comment says "the default ignores minus E704", but E704 is now explicitly added to the ignore list. Consider updating the comment to reflect the current state.
📝 Suggested fix
-# the default ignores minus E704 -ignore = E121,E123,E126,E226,E203,E24,E701,E704,W503,W504 +# the default ignores plus E701 and E704 for single-line class/def patterns +ignore = E121,E123,E126,E226,E203,E24,E701,E704,W503,W504🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.flake8 around lines 3 - 4, The comment on the .flake8 file is stale: update or remove the phrase "the default ignores minus E704" to reflect that E704 is now explicitly included in the ignore list; edit the comment above the ignore = E121,E123,E126,E226,E203,E24,E701,E704,W503,W504 line (or replace it) to accurately describe the current ignore configuration or remove the misleading text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@snakefmt/blocken.py`:
- Around line 233-235: The condition is comparing self.body[0].type (an int) to
the string "**" which is always false; update the check in the method using
self.body[0] to verify both the token type and the token text — e.g., test that
self.body[0].type equals the appropriate token type constant and that
self.body[0].string (or .value) == "**" (i.e., replace the current if
self.body[0].type == "**" with a combined type-and-string check using
self.body[0].type and self.body[0].string).
- Around line 357-358: Remove the redundant pre-assignment to the variable
`match` before the walrus check: delete the line `match =
_FMT_DIRECTIVE_RE.match(comment)` so the code only uses the walrus expression
`if match := _FMT_DIRECTIVE_RE.match(comment):` in blocken.py; this eliminates
the overwritten assignment and preserves the intended control flow around
`_FMT_DIRECTIVE_RE.match`.
---
Nitpick comments:
In @.flake8:
- Around line 3-4: The comment on the .flake8 file is stale: update or remove
the phrase "the default ignores minus E704" to reflect that E704 is now
explicitly included in the ignore list; edit the comment above the ignore =
E121,E123,E126,E226,E203,E24,E701,E704,W503,W504 line (or replace it) to
accurately describe the current ignore configuration or remove the misleading
text.
In `@snakefmt/blocken.py`:
- Around line 1859-1866: The parse function currently names its parameter input
which shadows the Python builtin; rename the parameter to a non-shadowing name
like source (or content) in the parse definition and update all references
inside the function (the isinstance check, the tokenize.generate_tokens call and
the TokenIterator invocation) to use the new name, and then update any callers
of parse to pass the renamed parameter positionally or by the new name if
keyword-used.
- Line 1086: Fix the typo in the inline comment that reads "please do not blame
if the olgorithm is slow :)" by changing "olgorithm" to "algorithm" so the
comment becomes "please do not blame if the algorithm is slow :)"; locate the
comment text in snakefmt/blocken.py and update it accordingly.
- Around line 57-65: The assignment self.tokens = tokens in __init__ is
redundant with self._live_tokens; remove the redundant self.tokens assignment
or, if external access is required, replace it with a documented property that
returns self._live_tokens (e.g., implement a tokens `@property`) so all uses
reference a single source of truth; update any external usages of the attribute
to use the chosen API (either the property name tokens or the internal
_live_tokens) and add a short docstring to clarify intended external access.
- Around line 1800-1802: Rename the misspelled attribute sort_direcives to
sort_directives everywhere to improve readability: update the __slots__ tuple
("mode", "sort_direcives") to ("mode", "sort_directives"), rename the attribute
annotations (mode: Mode; sort_direcives: bool -> sort_directives: bool) and
change all references/assignments/usages in FormatState, get_formatted, and
setup_formatter to use sort_directives; ensure any constructors, attribute
accesses, and tests are updated to the new name and run linters/tests to catch
any remaining references.
- Around line 1833-1842: The variables place_hode_str and place_hode in the
block inside the function that builds raw_str (the loop using python_codes and
snakemake_codes) are typos and should be renamed to placeholder_str and
placeholder respectively; update all usages including the initial assignment,
the while loop that multiplies the string, and the f-strings that reference them
(e.g., f"{indent}def l{place_hode_str}1ng(): ..." and f"{indent}def
l{place_hode_str}ng():") so they become f"{indent}def l{placeholder_str}1ng():
..." and f"{indent}def l{placeholder_str}ng():", and ensure you rename the
temporary variable place_hode to placeholder where it’s assigned and used.
- Around line 96-99: Fix the typos in the docstring in snakefmt/blocken.py that
documents the routine which "Returns a entire block" (the docstring describing
consuming until the end of a block); change "Donot" to "Do not" and "INDEDT" to
"INDENT", and while here correct "a entire block" to "an entire block" for
grammar consistency so the docstring reads clearly.
- Around line 443-448: The error text always appends the ambiguous line-number
note before wrapping and raising InvalidPython; change the logic in the block
that builds err_msg (the code that presently appends "\n\n(Note reported line
number may be incorrect...)" and then wraps err_msg and raises InvalidPython) to
only add that ambiguous-line-number note when the start_token (or whatever
marker is used to compute an adjusted line number) is not available or the line
could not be determined; keep the existing formatting that wraps err_msg in
"Black error:\n```\n...\n```\n" and then raise InvalidPython(err_msg) from None
but gate the extra note on start_token being falsy/None (or on the failure case
that determines line-number uncertainty).
In `@tests/test_blocken.py`:
- Around line 29-32: The parameter name `input` in generate_tokens shadows the
Python builtin; rename it (e.g., to `source` or `code`) in the function
signature and update all references inside the function (the call to
input.splitlines and any callers in tests) to use the new name so the tokenizer
still receives the same string argument.
In `@tests/test_formatter.py`:
- Line 2444: The test unpacks TestSortFormatting.sorting_comprehensive into
code1, formatted0 but never uses formatted0; change the unused variable to a
deliberately ignored name (e.g., _formatted0 or _) in the assignment where
TestSortFormatting.sorting_comprehensive is unpacked so the intent is clear and
linters won't flag formatted0 as unused.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c9fb18f7-ef63-4e85-a5c7-5d804a776e30
📒 Files selected for processing (4)
.flake8snakefmt/blocken.pytests/test_blocken.pytests/test_formatter.py
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (6)
snakefmt/blocken.py (5)
446-446: Redundantstr()call.
err_msgis already a string at this point. Thestr()conversion is unnecessary:- err_msg = f"Black error:\n```\n{str(err_msg)}\n```\n" + err_msg = f"Black error:\n```\n{err_msg}\n```\n"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakefmt/blocken.py` at line 446, The f-string that builds the Black error message unnecessarily calls str() on err_msg even though err_msg is already a string; replace the f-string expression "{str(err_msg)}" with "{err_msg}" so the line constructing err_msg (the variable err_msg used to format the Black error block) uses the value directly without redundant conversion.
532-534: Mutable class attribute defaults flagged by static analysis.Ruff warns about mutable default values for class attributes (RUF012). While these appear to be intentionally shared registries, consider using
ClassVartype hints to make the intent explicit:from typing import ClassVar class Block(ABC): subautomata: ClassVar[Mapping[str, "type[ColonBlock]"]] = {} deprecated: ClassVar[Mapping[str, str]] = {}This also applies to similar patterns at lines 908, 959-960, 1725, 1785, and 1804.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakefmt/blocken.py` around lines 532 - 534, The class-level mutable mappings subautomata and deprecated should be annotated as ClassVar to indicate intentional shared registries and silence RUF012; update their type hints to use ClassVar (e.g., subautomata: ClassVar[Mapping[str, "type[ColonBlock]"]] and deprecated: ClassVar[Mapping[str, str]]), add the necessary typing import if missing, and apply the same change to the other occurrences referenced (around the symbols at the other noted locations) so all mutable class defaults are explicitly ClassVar-annotated.
360-360: Minor: Prefernext()over single-element slice.Per Ruff RUF015, consider using
next()for extracting the first element:- option = [opt.strip() for opt in (options or "").split(",")][0] + option = next(opt.strip() for opt in (options or "").split(","))This is more idiomatic and avoids creating an intermediate list.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakefmt/blocken.py` at line 360, Replace the single-element list comprehension used to set the variable option with an iterator + next() to avoid building an intermediate list: change the assignment to use next((opt.strip() for opt in (options or "").split(",")), "") so it yields the first stripped element (or a safe default) — update the line that currently sets option = [opt.strip() for opt in (options or "").split(",")][0] in snakefmt/blocken.py.
1833-1837: Confusing variable reuse -raw_strassigned then immediately overwritten.The code assigns
raw_str = "".join(python_codes)at line 1834, uses it for collision checking, then immediately overwrites it at line 1837 withraw_str = "#\n". While functionally correct (the collision check uses the first value), this is confusing:- placeholder = "o" * 50 - raw_str = "".join(python_codes) + placeholder = "o" * 50 + all_python = "".join(python_codes) while placeholder in raw_str: placeholder *= 2 - raw_str = "#\n" + raw_str = "#\n"Consider using a distinct variable name for clarity:
placeholder = "o" * 50 - raw_str = "".join(python_codes) - while placeholder in raw_str: + all_python = "".join(python_codes) + while placeholder in all_python: placeholder *= 2 raw_str = "#\n"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakefmt/blocken.py` around lines 1833 - 1837, The variable raw_str is reused confusingly: it’s set to "".join(python_codes) for the collision check then immediately overwritten with "#\n"; rename the first usage to a distinct variable (e.g., joined_python or python_blob) and use that in the while placeholder in <joined_variable> loop, then keep raw_str = "#\n" as the intended subsequent value; update references in the collision-checking loop (placeholder doubling) to use the new joined variable and leave raw_str only for the "#\n" assignment.
1838-1852: Addstrict=Truetozip()calls to catch length mismatches.The
zip()calls on lines 1838 and 1847 pairpython_codeswithsnakemake_codes. If there's ever a bug in the iteration logic that causes these lists to have different lengths, data would be silently dropped:- for python_code, (snakemake_code, indent) in zip(python_codes, snakemake_codes): + for python_code, (snakemake_code, indent) in zip(python_codes, snakemake_codes, strict=True):And:
- for formatted, (snakemake_code, _) in zip(formatted_split, snakemake_codes): + for formatted, (snakemake_code, _) in zip(formatted_split, snakemake_codes, strict=True):This will raise
ValueErrorif lists have different lengths, making bugs easier to detect.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakefmt/blocken.py` around lines 1838 - 1852, The two zip() usages pairing python_codes and snakemake_codes should be made strict to surface length mismatches: update the first loop that iterates "for python_code, (snakemake_code, indent) in zip(python_codes, snakemake_codes):" and the second loop "for formatted, (snakemake_code, _) in zip(formatted_split, snakemake_codes):" to use zip(..., strict=True). This ensures mismatched lengths raise ValueError during the preparation of raw_str/py proxies and during reconstruction of final_str after format_black, helping catch bugs early.tests/test_formatter.py (1)
2396-2401:# fmt: skipin directives produces unusual output.The expected output shows
x = [ 1,2,3] # fmt: skipfollowed by a trailing comma on its own line (TAB * 2 + ",\n"). This is noted in the comment as a limitation ("currently# fmt: skipin directives is not supported").Consider whether this behavior should be:
- Fixed to properly handle
# fmt: skipinline- Documented as a known limitation in user-facing docs
- Tracked as a future enhancement issue
Would you like me to open an issue to track improving
# fmt: skipsupport in directives?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_formatter.py` around lines 2396 - 2401, The test shows a directive line containing "# fmt: skip" (seen in the test case using formatter.get_formatted()) but the formatter leaves a trailing comma on the next line; update the directive-handling logic (e.g., the Formatter method that processes directives — look for functions named like process_directives, _apply_directives, or _should_skip_range) to treat an inline "# fmt: skip" as skipping the entire directive token sequence (including following separator tokens such as trailing commas) so no lone comma line is emitted; adjust the formatter's token-range skip logic to remove or absorb following punctuation when a directive applies, add a focused unit test mirroring this example, and if you prefer not to change behavior add/update docs/tests to explicitly record this limitation instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@snakefmt/blocken.py`:
- Around line 617-623: The decorator-handling loop can hang if no recognized
block follows; wrap the calls to tokens.next_new_line() inside a try/except to
catch the TokenIterator/UnsupportedSyntax EOF case (or otherwise check
tokens.has_next()/peek() before calling next_new_line), and on EOF either raise
a clearer error or break out and handle the accumulated headers (call append_sub
with an appropriate fallback or abort with a descriptive exception). Update the
block around line handling where you detect line.body[0].string == "@" to use
self.recognize, tokens.next_new_line, and append_sub with this
EOF/unsupported-syntax guard so the loop cannot run forever.
- Around line 233-235: Update the condition that detects unpacking operators on
the parameter token (currently checking self.body[0].string == "**") to also
treat the single-star unpacking operator as a new parameter start; e.g., change
the test to check for self.body[0].string == "*" or self.body[0].string == "**"
(or use membership like self.body[0].string in ("*", "**")) so both *args and
**kwargs trigger the early-return that starts a new parameter line.
---
Nitpick comments:
In `@snakefmt/blocken.py`:
- Line 446: The f-string that builds the Black error message unnecessarily calls
str() on err_msg even though err_msg is already a string; replace the f-string
expression "{str(err_msg)}" with "{err_msg}" so the line constructing err_msg
(the variable err_msg used to format the Black error block) uses the value
directly without redundant conversion.
- Around line 532-534: The class-level mutable mappings subautomata and
deprecated should be annotated as ClassVar to indicate intentional shared
registries and silence RUF012; update their type hints to use ClassVar (e.g.,
subautomata: ClassVar[Mapping[str, "type[ColonBlock]"]] and deprecated:
ClassVar[Mapping[str, str]]), add the necessary typing import if missing, and
apply the same change to the other occurrences referenced (around the symbols at
the other noted locations) so all mutable class defaults are explicitly
ClassVar-annotated.
- Line 360: Replace the single-element list comprehension used to set the
variable option with an iterator + next() to avoid building an intermediate
list: change the assignment to use next((opt.strip() for opt in (options or
"").split(",")), "") so it yields the first stripped element (or a safe default)
— update the line that currently sets option = [opt.strip() for opt in (options
or "").split(",")][0] in snakefmt/blocken.py.
- Around line 1833-1837: The variable raw_str is reused confusingly: it’s set to
"".join(python_codes) for the collision check then immediately overwritten with
"#\n"; rename the first usage to a distinct variable (e.g., joined_python or
python_blob) and use that in the while placeholder in <joined_variable> loop,
then keep raw_str = "#\n" as the intended subsequent value; update references in
the collision-checking loop (placeholder doubling) to use the new joined
variable and leave raw_str only for the "#\n" assignment.
- Around line 1838-1852: The two zip() usages pairing python_codes and
snakemake_codes should be made strict to surface length mismatches: update the
first loop that iterates "for python_code, (snakemake_code, indent) in
zip(python_codes, snakemake_codes):" and the second loop "for formatted,
(snakemake_code, _) in zip(formatted_split, snakemake_codes):" to use zip(...,
strict=True). This ensures mismatched lengths raise ValueError during the
preparation of raw_str/py proxies and during reconstruction of final_str after
format_black, helping catch bugs early.
In `@tests/test_formatter.py`:
- Around line 2396-2401: The test shows a directive line containing "# fmt:
skip" (seen in the test case using formatter.get_formatted()) but the formatter
leaves a trailing comma on the next line; update the directive-handling logic
(e.g., the Formatter method that processes directives — look for functions named
like process_directives, _apply_directives, or _should_skip_range) to treat an
inline "# fmt: skip" as skipping the entire directive token sequence (including
following separator tokens such as trailing commas) so no lone comma line is
emitted; adjust the formatter's token-range skip logic to remove or absorb
following punctuation when a directive applies, add a focused unit test
mirroring this example, and if you prefer not to change behavior add/update
docs/tests to explicitly record this limitation instead.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d5a8390e-0306-4f92-834b-c45b26ea794b
📒 Files selected for processing (2)
snakefmt/blocken.pytests/test_formatter.py
| if self.body[0].string == "**": | ||
| return True | ||
| return False |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify how *args is handled in directive blocks
rg -n "is_keyword_line" --type=py -A3 -B3Repository: snakemake/snakefmt
Length of output: 1564
🏁 Script executed:
# Get the full implementation of is_keyword_line property
sed -n '220,250p' snakefmt/blocken.py | cat -nRepository: snakemake/snakefmt
Length of output: 1182
🏁 Script executed:
# Search for test cases or documentation about this property
rg -n "is_keyword_line\|keyword_line" --type=py -B2 -A5 | head -100Repository: snakemake/snakefmt
Length of output: 44
🏁 Script executed:
# Look for patterns of * and ** handling in the codebase
rg -n 'string.*\*|\*\*' --type=py snakefmt/blocken.py | head -30Repository: snakemake/snakefmt
Length of output: 480
🏁 Script executed:
# Check if there are parameter-related tests
fd -name "*test*" --type=f | xargs grep -l "keyword_line\|\*args\|\*\*kwargs" 2>/dev/null | head -5Repository: snakemake/snakefmt
Length of output: 292
🏁 Script executed:
# Get more context about how is_keyword_line is used in the code
rg -n "is_keyword_line" --type=py -B5 -A10 snakefmt/blocken.pyRepository: snakemake/snakefmt
Length of output: 2440
🏁 Script executed:
# Look for any comments or documentation about keyword lines vs positional arguments
rg -n "keyword.*line|positional|args|kwargs" --type=py snakefmt/blocken.py | head -50Repository: snakemake/snakefmt
Length of output: 1820
🏁 Script executed:
# Check if there's a test file for blocken.py
find . -type f -name "*test*blocken*" 2>/dev/nullRepository: snakemake/snakefmt
Length of output: 85
🏁 Script executed:
# Check test cases for is_keyword_line
rg -n "is_keyword_line|\*args|\*\*kwargs" ./tests/test_blocken.py -B3 -A3Repository: snakemake/snakefmt
Length of output: 44
🏁 Script executed:
# Look for examples of parameter handling in tests
rg -n "keyword.*line|positional.*arg" ./tests/test_blocken.py -B2 -A5Repository: snakemake/snakefmt
Length of output: 44
🏁 Script executed:
# Search for docstrings or comments explaining keyword vs positional args
rg -n "keyword|positional|\*args|\*\*" snakefmt/blocken.py -B1 -A1 | grep -E "(#|\"\"\"|-\s)" | head -30Repository: snakemake/snakefmt
Length of output: 1649
🏁 Script executed:
# Check if there's any test data that shows parameter handling
fd -name "*.py" -path "*/tests/*" -exec rg -l "Argument\|keyword.*arg|positional" {} \;Repository: snakemake/snakefmt
Length of output: 292
🏁 Script executed:
# Look for actual usage examples of parameters in test files
rg -n "\*args|\*\*kwargs" ./tests/ -B2 -A2 | head -60Repository: snakemake/snakefmt
Length of output: 1596
🏁 Script executed:
# Check if there's documentation about the Argument class and parameter semantics
rg -n "class Argument" snakefmt/blocken.py -A20Repository: snakemake/snakefmt
Length of output: 44
🏁 Script executed:
# Look for any comments that explain why ** is treated as keyword but maybe * isn't
rg -n "keyword.*\*\*|positional.*\*" snakefmt/blocken.py -B2 -A2Repository: snakemake/snakefmt
Length of output: 44
🏁 Script executed:
# Check what tokenize.OP represents
python3 << 'EOF'
import tokenize
import io
# Test what * and ** are tokenized as
test_cases = ["*args", "**kwargs", "a=1"]
for case in test_cases:
tokens = list(tokenize.generate_tokens(io.StringIO(case).readline))
print(f"\n{case}:")
for tok in tokens:
if tok.type != tokenize.ENCODING and tok.type != tokenize.ENDMARKER:
print(f" type={tokenize.tok_name[tok.type]}, string={repr(tok.string)}")
EOFRepository: snakemake/snakefmt
Length of output: 337
🏁 Script executed:
# Check how parameters are actually parsed in real function definitions
rg -n "def.*\*args|\*\*kwargs" snakefmt/ --type=py -A5 | head -40Repository: snakemake/snakefmt
Length of output: 2277
🏁 Script executed:
# Look at the actual usage context more carefully
sed -n '1140,1230p' snakefmt/blocken.py | cat -nRepository: snakemake/snakefmt
Length of output: 5104
🏁 Script executed:
# Let's understand the distinction better - what makes something a "keyword line"
# Check the context around how positional vs keyword args are processed
sed -n '1175,1190p' snakefmt/blocken.py | cat -nRepository: snakemake/snakefmt
Length of output: 1139
🏁 Script executed:
# Look at the comment explaining positional arguments
sed -n '1105,1115p' snakefmt/blocken.py | cat -nRepository: snakemake/snakefmt
Length of output: 425
🏁 Script executed:
# Verify: in Python function definitions, is *args considered "keyword" or "positional"?
python3 << 'EOF'
import inspect
def example_func(a, b=2, *args, c=3, **kwargs):
pass
sig = inspect.signature(example_func)
for param_name, param in sig.parameters.items():
kind_name = param.kind.name
print(f"{param_name}: kind={kind_name}")
if param.default != inspect.Parameter.empty:
print(f" has default: {param.default}")
EOFRepository: snakemake/snakefmt
Length of output: 227
🏁 Script executed:
# Check if there are any tests that distinguish between * and **
rg -n "VAR_POSITIONAL|VAR_KEYWORD" snakefmt/ --type=pyRepository: snakemake/snakefmt
Length of output: 44
🏁 Script executed:
# Verify the concern: what happens if *args line follows a positional arg
# Check if there are any actual test cases with *args
rg -n "\*args" snakefmt/blocken.py -B2 -A2Repository: snakemake/snakefmt
Length of output: 448
🏁 Script executed:
# Look for test cases that might show the actual parameter formatting
fd "test.*blocken" -type f | xargs rg -l "def.*args" 2>/dev/null | head -3Repository: snakemake/snakefmt
Length of output: 299
🏁 Script executed:
# Let's look at actual test examples with *args formatting
rg -n "def.*\*args" tests/test_formatter.py -A10 -B2 | head -80Repository: snakemake/snakefmt
Length of output: 2188
🏁 Script executed:
# Check if the Argument class has any special handling for * operator
rg -n "class.*Argument|def.*format.*arg" snakefmt/blocken.py | head -20Repository: snakemake/snakefmt
Length of output: 1225
🏁 Script executed:
# Let's trace through the logic more carefully - what happens if a parameter line starts with *
# Create a simple test to see how the current code behaves
python3 << 'EOF'
import tokenize
import io
# Simulate a line starting with *args
test_line = "*args,"
tokens = list(tokenize.generate_tokens(io.StringIO(test_line).readline))
print(f"Tokens for '{test_line}':")
for tok in tokens:
if tok.type != tokenize.ENCODING and tok.type != tokenize.ENDMARKER:
print(f" type={tokenize.tok_name[tok.type]}, string={repr(tok.string)}")
# Check the first token
first_token = [t for t in tokens if t.type not in (tokenize.ENCODING, tokenize.ENDMARKER, tokenize.NEWLINE)][0]
print(f"\nFirst token: type={tokenize.tok_name[first_token.type]}, string={repr(first_token.string)}")
print(f"First token string == '**': {first_token.string == '**'}")
print(f"First token string == '*': {first_token.string == '*'}")
print(f"First token type == tokenize.NAME: {first_token.type == tokenize.NAME}")
print(f"First token type == tokenize.NUMBER: {first_token.type == tokenize.NUMBER}")
EOFRepository: snakemake/snakefmt
Length of output: 363
🏁 Script executed:
# One final check - are there any existing cases where self.body[0].string is accessed without type checking?
rg -n "self\.body\[0\]\.string" snakefmt/blocken.py -B2 -A2Repository: snakemake/snakefmt
Length of output: 201
🏁 Script executed:
# Check if there might be cases where self.body is empty or self.body[0] doesn't have .string
rg -n "self\.body\[0\]" snakefmt/blocken.py -B1 -A1 | head -40Repository: snakemake/snakefmt
Length of output: 502
Also handle *args unpacking operator.
The condition checks for **kwargs unpacking but not *args. Both should mark the start of a new parameter line. The missing check for * could cause *args parameters to be incorrectly appended to the previous parameter.
Consider:
Suggested fix
- if self.body[0].string == "**":
+ if self.body[0].type == tokenize.OP and self.body[0].string in ("*", "**"):
return True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@snakefmt/blocken.py` around lines 233 - 235, Update the condition that
detects unpacking operators on the parameter token (currently checking
self.body[0].string == "**") to also treat the single-star unpacking operator as
a new parameter start; e.g., change the test to check for self.body[0].string ==
"*" or self.body[0].string == "**" (or use membership like self.body[0].string
in ("*", "**")) so both *args and **kwargs trigger the early-return that starts
a new parameter line.
| elif line.body[0].string == "@": | ||
| headers = [line] | ||
| while True: | ||
| headers.append(tokens.next_new_line()) | ||
| if block := self.recognize(headers[-1].body[0]): | ||
| break | ||
| append_sub(block, headers) |
There was a problem hiding this comment.
Potential infinite loop when decorators aren't followed by a recognized block.
The while True loop on line 619 consumes lines after a decorator (@) until self.recognize() returns a block type. If decorators are followed by unrecognized content (or EOF), this loop will continue until TokenIterator.__next__ raises UnsupportedSyntax.
Consider adding a safety check:
elif line.body[0].string == "@":
headers = [line]
while True:
headers.append(tokens.next_new_line())
+ if headers[-1].end.type == tokenize.ENDMARKER:
+ raise UnsupportedSyntax(
+ f"Decorator at line {line.body[0].start[0]} not followed by def/class"
+ )
if block := self.recognize(headers[-1].body[0]):
break
append_sub(block, headers)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@snakefmt/blocken.py` around lines 617 - 623, The decorator-handling loop can
hang if no recognized block follows; wrap the calls to tokens.next_new_line()
inside a try/except to catch the TokenIterator/UnsupportedSyntax EOF case (or
otherwise check tokens.has_next()/peek() before calling next_new_line), and on
EOF either raise a clearer error or break out and handle the accumulated headers
(call append_sub with an appropriate fallback or abort with a descriptive
exception). Update the block around line handling where you detect
line.body[0].string == "@" to use self.recognize, tokens.next_new_line, and
append_sub with this EOF/unsupported-syntax guard so the loop cannot run
forever.
|
I had a very initial attempt a couple of years ago at writing a grammar for snakemake and using Lark to build a parser that could then be used by all projects that need to parse snakemake code: https://github.com/mbhall88/snakemake-grammar. I think that code is probably not worth building on, but I did have a conversation with Johannes and he agreed this would be a wonderful first start. Also see which is the main hinge point for this issue. So I think rather than enhancing the snakefmt parser I think we need to zoom out and get a broader snakemake grammar and thus parser made. |
|
Thanks for the pointer to snakemake-grammar and the broader context. I agree completely that the right goal is a unified grammar that all snakemake-adjacent tools can build on, rather than each tool maintaining its own parser. I've compared the snakefmt and snakemake parser implementations and distilled that into the On the question of lark specifically: I don't think it's the right foundation for a production snakemake parser, for a few reasons:
I think The concrete split would be: I'm planning to start a tree-sitter-snakemake repository to work this out. Happy to coordinate with you and Johannes on the grammar design if that's useful. |
Motivation
The existing parser (
parser/parser.py,parser/syntax.py) has grown organically and is now difficult to reason about:indent tracking, comment handling, parameter formatting, and
fmt:directive handling are all interleaved in a single imperative loop.Adding new features (e.g.
# fmt: skipon keyword lines, supporting main snakemake features) require touching many tightly-coupled places.Moreover, the compilation logic in the main snakemake repository and the planned LSP server (for code highlighting and symbol navigation) share similar parsing concerns.
Maintaining them separately risks redundancy, drift, and subtle inconsistencies.
This PR rewrites the parser core as an explicit Block tree, making each concern addressable independently — bringing the architecture back under human control.
What changed
New:
snakefmt/blocken.pyA self-contained parser that walks the token stream and produces a tree of typed
Blockobjects:Each
Blockexposes:segment2format() -> Generator[tuple[str, str | None], None, None]: python/snakemake segments for combined formattingcompilation() -> str: pure Python for snakemake executioncomponents() -> Iterator[DocumentSymbol]: LSP document symbols (e.g. for VS Code highlighting)Comment ownership and blank-line attribution are resolved once at tree construction time, not scattered across formatting passes.
Test for new design:
tests/test_blocken.pyUnit tests for block parsing and
DocumentSymbolextraction, independent of formatting output.Modified:
tests/test_formatter.pyA small number of edge-case behaviors were adjusted where the previous implementation had unintentional quirks.
I'd welcome review of these changes specifically.
If any adjusted behavior should be preserved, I'm happy to discuss or restore it.
What this enables / will support
# fmt: skipon keyword lines: theBlockandLogicalLinestructure makes skip directives detectable before formatting begins, with no changes needed across multiple call sitesBlock.compilation()stubs are present; planned to expose compilation functions to the main snakemake repositoryBlock.components()yieldsDocumentSymbol; planned to support a language server for code highlighting and symbol navigationBlock.compilation()is validated, snakemake could depend on snakefmt for parsing rather than maintaining parallel implementationsQuestions for maintainer
Blockhierarchy cover all formatting cases you're aware of?test_formatter.pybehaviors acceptable?just wire
blocken.pyin to replace the existing logic,or first split it into smaller modules (e.g.
TokenIterator,LogicalLine,FormatState/format_blackas utilities, and theBlockhierarchy as the core) and retire the existingparser/package?The old
parser/andformatterare untouched in this branch, andblocken.pyruns in parallel. The transition can be done incrementally.Summary by CodeRabbit
Release Notes
New Features
Chores